A string is a sequence of one or more characters, and one of the most frequently used types in programming. It is therefore fitting that we acquaint ourselves with the idea of operating on strings.
You might be familiar by now with string and character literals from the introductory chapter, which introduced some literals, or from other programming languages.
A string literal is surrounded by double quotes: " string "
. Within the string, you can escape a double-quote using a backslash:
In [1]:
"This string contains a \" double quote \" "
Out[1]:
Strings are immutable and indexable – indices return the characters at the index position, starting from 1.
String and character literals are differentiated by two indicia:
Char
type object necessarily has the length one (or potentially zero)""
, Char
type objects are introduced by single apostrophes ''
.
In [7]:
"This is a string."
Out[7]:
In [8]:
'T' # This is a Char
Out[8]:
In [9]:
"T" # This is a String of length 1
Out[9]:
In [10]:
'This is a Char' # Will throw an error; a Char can only be of length 1
The second of these tends to be somewhat vexing for many programmers who are used to the equivalence of ''
and ""
in languages that do not necessarily have an implemented type or class for characters mirroring Char
.
So while for instance in Python, 'a' == "a"
holds, this is not the case in Julia:
In [2]:
typeof("a")
Out[2]:
In [3]:
typeof('a')
Out[3]:
In [4]:
"a" == 'a'
Out[4]:
In [13]:
multiline_declaration = """
We hold these truths to be self-evident,
that all men are created equal,
that they are endowed by their Creator with certain unalienable Rights,
that among these are Life, Liberty and the pursuit of Happiness.
That to secure these rights, Governments are instituted among Men,
deriving their just powers from the consent of the governed...
"""
print(multiline_declaration)
As you can see, the use of the """
or 'heredoc' format has preserved the line breaks and structure of the text, a rather helpful feature where longer texts are concerned.
Regular expressions (regexes) are special strings that represent particular patterns.
They are useful in matching and searching text, and a good knowledge of regex should be essential knowledge for any good functional programmer.
To construct a regex literal, preface the string with r
:
In [14]:
regex_literal = r"a|e|i|o|u"
Out[14]:
This is a regex literal that matches (English) vowels. Julia recognises regex literals as the type regex
:
In [16]:
typeof(regex_literal)
Out[16]:
In [17]:
declaration = "When in the Course of human events"
Out[17]:
In [18]:
declaration[1:4] # Get the substring from range 1 to 4
Out[18]:
You might recall that a range might actually have a step
attribute, which we can use to obtain every _n_th letter within a text.
Let's see every odd-numbered letter within the first few words of the Declaration of Independence:
In [21]:
declaration[1:2:end] # Get the substring from range 1 to end using steps of 2
Out[21]:
You might remember that end
, which we used above to extend the range across the entire length of the string, behaves like a number. Therefore, you can use it to create a substring that excludes the last, say, five letters:
In [22]:
declaration[1:end-5] # Get the substring from range 1 to end-5
Out[22]:
*
)In most programming languages, maths and string operations correspond, so you can use +
to concatenate and *
to repeat a string.
This is not the case in Julia. +
has no method for ASCIIString
s. What you would expect +
to do is accomplished by *
:
In [25]:
"I" * " <3 " * "Julia"
Out[25]:
In [26]:
"I will not say bad things about functional languages again. " ^ 10
Out[26]:
split()
)The split()
function separates a piece of text at a particular character, which it also removes.
The result is an array of the chunks. By default, split will separate at spaces, but you can provide any other string – not even necessarily a single character, as the third example shows:
In [27]:
split(declaration) # Split by the character: spaces
Out[27]:
In [28]:
split(declaration, "e") # Split by the character: 'e'
Out[28]:
In [31]:
split(declaration, "the") # Split by the string: 'the'
Out[31]:
If you provide ""
as the string to split at, Julia will split the text into individual letters.
You may also use a regex to split your text at:
In [32]:
regex_literal = r"a|e|i|o|u"
split(declaration, regex_literal) # Split by the regex: "a|e|i|o|u" (any vowels)
Out[32]:
Needless to say, since strings are immutable, the original string is not affected by the application of split()
.
In [34]:
print(declaration) # Original string is unchanged
In [37]:
love = "<3"
Out[37]:
In [38]:
"I " * love * " Julia"
Out[38]:
While this is technically correct, it is much faster by using string interpolation, in which case we would refer back to the variable love
as $(love)
within the string.
Julia knows this means it is to replace $(love)
with the contents of the variable love
:
In [39]:
"I $(love) Julia" # Return the variable defined in $()
Out[39]:
You can put anything within the parentheses in string interpolation – anything Julia knows how to handle. For instance, including an expression in a string, you get
In [40]:
"Three plus four is $(3+4)." # Return the function defined in $()
Out[40]:
If, and only if, you are referring to a variable, you can omit the parentheses (but not if you are referring to an expression):
In [41]:
"I $love Julia" # Return the variable defined in $
Out[41]:
As it has been mentioned, the main utility of regular expressions (Regexes) is to find things within long pieces of text.
In the following, we will introduce the three main regex search functions of Julia - match()
, matchall()
and eachmatch()
, with reference to a bit of the Declaration of Independence:
In [42]:
declaration = "We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness.--That to secure these rights, Governments are instituted among Men, deriving their just powers from the consent of the governed, --That whenever any Form of Government becomes destructive of these ends, it is the Right of the People to alter or to abolish it, and to institute new Government, laying its foundation on such principles and organizing its powers in such form, as to them shall seem most likely to effect their Safety and Happiness. Prudence, indeed, will dictate that Governments long established should not be changed for light and transient causes; and accordingly all experience hath shewn, that mankind are more disposed to suffer, while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, pursuing invariably the same Object evinces a design to reduce them under absolute Despotism, it is their right, it is their duty, to throw off such Government, and to provide new Guards for their future security.--Such has been the patient sufferance of these Colonies; and such is now the necessity which constrains them to alter their former Systems of Government. The history of the present King of Great Britain is a history of repeated injuries and usurpations, all having in direct object the establishment of an absolute Tyranny over these States. To prove this, let Facts be submitted to a candid world."
Out[42]:
If you are familiar with regular expressions, plod ahead! However, if
(GIR 0AA)|((([A-Z-[QVX]][0-9][0-9]?)|(([A-Z-[QVX]][A-Z-[IJZ]][0-9][0-9]?)|(([A-Z-[QVX]][0-9][A-HJKSTUW])|([A-Z-[QVX]][A-Z-[IJZ]][0-9][ABEHMNPRVWXY])))) [0-9][A-Z-[CIKMOV]]{2})
looks like gobbledygook to you or you feel your regex fu is a little rusty, put down this book and consult the Regex cheatsheet or, even better, Jeffrey Friedl's amazing book on mastering regexes.
In [43]:
search(declaration, "Government")
Out[43]:
search()
also accepts regular expressions:
In [46]:
search(declaration, r"th.{2,3}") # Regex translation: Match (th) and any of the next 2 to 3 characters ({.2,3}) after it.
Out[46]:
To retrieve the result, rather than its index, you can pass the resulting index off to the string as the subsetting range, using the square bracket []
syntax:
In [47]:
declaration[search(declaration, r"th.{2,3}")] # Return string with range indicies defined by regex
Out[47]:
Ah, so that's the word it found!
Where a search string is not found, search()
will yield 0:-1
.
In [50]:
search(declaration, r"USSR") # Return results for Communism in the declaration of Independence
Out[50]:
That is an odd result, until you realise the reason: for any string s
, s[0:-1]
will necessarily yield ""
(that is, an empty string).
match
)The problem with search()
is that it retrieves one, and only one, result – the first within the string passed to it.
The match()
family of functions can help us with finding more results:
match()
retrieves either the first match or nothing within the text.matchall()
returns an array of all matching substrings.eachmatch()
returns an iterator over all matches.The match()
family of functions needs a regular expression literal as a search argument. This is so even if the regular expression does not make use of any pattern matching beyond a simple string. Thus,
In [51]:
match(r"truths", declaration) # The r prefix makes it a Regex type
Out[51]:
is valid, while
In [52]:
match("truths", declaration) # Match does not take just strings
yields an error:
RegexMatch
objectsMost regex search functions return an object of type RegexMatch
.
As the name reveals, a RegexMatch
is a composite type representing a match. As such, it encapsulates (to use a little more OOP terminology than one would normally be allowed to in a book on functional programming) four values, the first three of which will be of immediate interest to us:
RegexMatch.match
is the matched substring.RegexMatch.captures
is an array of types that represent the types the regex would capture.RegexMatch.offset
is generally an Int64
that represents the index of the first character of the matched string where there is a single match (e.g. when using match()
).To illustrate, let's consider the result of a match()
call, which will be introduced in the next subsection:
In [55]:
m = match(r"That .*?,", declaration) # Regex translation: Match 'That' (That) then a space character,
# followed by a lazy match (least characters) with any characters (.*?)
# until you hit a comma character (,)
Out[55]:
In [56]:
m.match # What was the maching string?
Out[56]:
In [57]:
m.captures # What types did we capture?
Out[57]:
In [58]:
m.offset # Where is the first character of the matched string in the original string?
Out[58]:
In [61]:
declaration[212:(212+length(m.match))] # Get the string from 212 to the end of the length of the matched substring
Out[61]:
In [68]:
match(r"That .*?,", declaration) # Return the first Regex Match
Out[68]:
The result is a RegexMatch
object. The object can be inspected using .match
(e.g. match(r"truths", declaration).match
).
In [67]:
match(r"That .*?,", declaration).match # Matched String
Out[67]:
In [63]:
matchall(r"That .*?,", declaration) # Return all matches of the Regex String
Out[63]:
You can use array notation to easily parse this array for the actual substrings (starting at Index 1):
In [83]:
matchall(r"That .*?,", declaration)[1]
Out[83]:
In [84]:
[println(matchall(r"That .*?,", declaration)[i]) for i in 1:length(matchall(r"That .*?,", declaration))] # Using list comprehension
Out[84]:
eachmatch
)eachmatch()
returns an object known as an iterator, specifically of the type RegexMatchIterator
.
We have on and off encountered iterators, but we will not really deal with them in depth until later. Suffice it to say an iterator is an object that contains a list of items that can be iterated through.
The iterator will iterate over a list of RegexMatch
objects, so if we want the results themselves, we will need to call the .match
method on each of them:
In [87]:
eachmatch(r"That .*?,", declaration) # Returns a long iterator ready to iterate on the string
Out[87]:
In [88]:
for i in eachmatch(r"That .*?,", declaration) # For every match in the iterator
println("A matching search result is: $(i.match)") # Print the actual substring using i.match
end
The result is quite similar to that returned by matchall()
:
In [92]:
matchall(r"That .*?,", declaration)[1:2]
Out[92]:
In [93]:
ismatch(r"truth(s)?", declaration)
Out[93]:
In [94]:
ismatch(r"sausage(s)?", declaration)
Out[94]:
In [97]:
replace(declaration, "truth", "sausage") # Update the Declaration for 2016
Out[97]:
An interesting feature of replace()
is that the replacement does not need to be a string.
In fact, it is possible to pass a function as the third argument (as always, without the parentheses ()
that signify a function call). Julia will interpret this as 'replace the substring with the result of passing the substring to this function':
In [98]:
replace(declaration, "truth", uppercase) # Make sure people get the TRUTH of the Declaration
Out[98]:
Much more dignified than self-evident sausages, I'd say! At risk of repeating myself, it is important to note that since strings are immutable, replace()
merely returns a copy of the string with the search string replaced by the replacement string or the result of the replacement function, and the original string itself will remain unaffected.
In [99]:
declaration # Unchanged / Immutable
Out[99]:
In [108]:
replace(declaration, "truth", x -> (x * " ") ^ 10) # We can use anonymous functions too; lets get more truths in here
Out[108]:
Where the substring is not found, the result will be, unsurprisingly, an unaltered string.
In [109]:
replace(declaration, "USSR", x -> (x * " ") ^ 10) # No match == No change
Out[109]:
Flag | Function |
---|---|
i | Case-insensitive pattern matching |
m | Treats string as a multiline string, so that ^ and $ will refer to the start or end of any line within the string. |
s | Treats line as a single line. This will result in . accepting a newline as well. When used together with m , it will result in . matching every possible character while still allowing ^ and $ to match, just after and just before newlines within the string. |
x | Ignore non-backslashed, non-classed whitespace. |
Flags are appended to the end of each regex, which might strike users more familiar with e.g. the Pythonic way of modifying the regex search object itself, as somewhat unusual:
In [110]:
multiline = r"^We"m
Out[110]:
In this case, the regex r"^We"
was augmented by the multiline flag, appended at its end.
Case transformations are functions that act on String
s and transform character case. Let's examine the effect of these transformations in turn.
Function | Effect | Result |
---|---|---|
uppercase() |
Converts the entire string to upper-case characters | WE HOLD THESE TRUTHS TO BE SELF-EVIDENT |
lowercase() |
Converts the entire string to lower-case characters | we hold these truths to be self-evident |
ucfirst() |
Converts the first character of the string to upper-case | We hold these truths to be self-evident |
lcfirst() |
Converts the first character of the string ot lower-case | we hold these truths to be self-evident |